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ABSTRACT 

Given a collection of objects and an associated similarity 
measure, the all-pairs similarity search problem asks us to 
find all pairs of objects with similarity greater than a certain 
user-specified threshold. Locality-sensitive hashing (LSH) 
based methods have become a very popular approach for 
this problem. However, most such methods only use LSH for 
the first phase of similarity search - i.e. efficient indexing for 
candidate generation. In this paper, we present BayesLSH, 
a principled Bayesian algorithm for the subsequent phase of 
similarity search - performing candidate pruning and simi- 
larity estimation using LSH. A simpler variant, BayesLSH- 
Lite, which calculates similarities exactly, is also presented. 
BayesLSH is able to quickly prune away a large majority 
of the false positive candidate pairs, leading to significant 
speedups over baseline approaches. For BayesLSH, we also 
provide probabilistic guarantees on the quality of the out- 
put, both in terms of accuracy and recall. Finally, the qual- 
ity of BayesLSH's output can be easily tuned and does not 
require any manual setting of the number of hashes to use 
for similarity estimation, unlike standard approaches. For 
two state-of-the-art candidate generation algorithms, All- 
Pairs [3] and LSH, BayesLSH enables significant speedups, 
typically in the range 2x-20x for a wide variety of datasets. 

1. INTRODUCTION 

Similarity search is a problem of fundamental importance 
for a broad array of fields, including databases, data mining 
and machine learning. The general problem is as follows: 
given a collection of objects D with some similarity measure 
s defined between them and a query object q, retrieve all 
objects from D that are most similar to q according to the 
similarity measure s. The user may be either interested in 
the top-fc most similar objects to q, or the user may want 
all objects x such that s(x,q) > t, where t is the similar- 
ity threshold. A more specific version of similarity search is 
the All Pairs similarity search problem, where there is no 
explicit query object, but instead the user is interested in 
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all pairs of objects with similarity greater than some thresh- 
old. The number of applications even for the more specific 
all pairs similarity search problem is impressive: cluster- 
ing [20], semi-supervised learning [27], information retrieval 
(including text, audio and video), query refinement [3], near- 
duplicate detection [24], collaborative filtering, link predic- 
tion for graphs f 6 , and 3-D scene reconstruction [TJ among 
others. In many of these applications, approximate solu- 
tions with small errors in similarity assessments are accept- 
able if they can buy significant reductions in running time 
e.g. in web-scale clustering [5] [20], information retrieval [8], 
near-duplicate detection for web crawling [181 110] and graph 
clustering [22] , 

Roughly speaking, similarity search algorithms can be di- 
vided into two main phases - candidate generation and can- 
didate verification. During the candidate generation phase, 
pairs of objects that are good candidates for having simi- 
larity above the user-specified threshold are generated using 
one or another indexing mechanism, while during candidate 
verification, the similarity of each candidate pair is verified 
against the threshold, in many cases by exact computation 
of the similarity. The traditional indexing structures used 
for candidate generation were space-partitioning approaches 
such as kd-trees and R-trees, but these approaches work 
well only in low dimensions (less than 20 or so [7]). An im- 
portant breakthrough was the invention of locality-sensitive 
hashing [111 Qj], where the idea is to find a family of hash 
functions such that for a random hash function from this 
family, two objects with high similarity are very likely to be 
hashed to the same bucket. One can then generate candidate 
pairs by hashing each object several times using randomly 
chosen hash functions, and generating all pairs of objects 
which have been hashed to the same bucket by at least one 
hash function. Although LSH is a randomized, approximate 
solution to candidate generation, similarity search based on 
LSH has nonetheless become immensely popular because it 
provides a practical solution for high dimensional applica- 
tions along with theoretical guarantees for the quality of the 
approximation [2]. 

In this article, we show how LSH can be exploited for 
the phase of similarity search subsequent to candidate gen- 
eration i.e. candidate verification and similarity computa- 
tion. We adopt a principled Bayesian approach that allows 
us to reason about the probability that a particular pair 
of objects will meet the user-specified threshold by inspect- 
ing only a few hashes of each object, which in turn allows 
us to quickly prune away unpromising pairs. Our Bayesian 
approach also allows us to estimate similarities to a user- 



specified level of accuracy without requiring any tuning of 
the number of hashes, overcoming a significant drawback of 
standard similarity estimation using LSH. We develop two 
algorithms, called BayesLSH and BayesLSH-Lite, where 
the former performs both candidate pruning and similarity 
estimation, while the latter only performs candidate prun- 
ing and computes the similarities of unpruned candidates 
exactly. Essentially, BayesLSH provides a way to trade-off 
accuracy for speed in a controlled manner. Both BayesLSH 
and BayesLSH-Lite can be combined with any existing can- 
didate generation algorithm, such as AUPairs [3] or LSH. 
Concretely, BayesLSH provides the following probabilistic 
guarantees: 

Given a collection of objects D, an associated similarity 
function s(., .), and a similarity threshold t; recall parameter 
e and accuracy parameters 8, 7; return pairs of objects (x, y) 
along with similarity estimates s x ,y such that: 

1. Pr[s(x,y) > t] > e i.e. each pair with a greater than 
e probability of being a true positive is included in the 
output set. 

2. Pr[\s x , y — s(x,y)\ > 8] < 7 i.e. each associated similar- 
ity estimate is accurate up to S-error with probability 
> 1 - 7. 

With BayesLSH-Lite, the similarity calculations are ex- 
act, so there is no need for guarantee 2, but guarantee 1 
from above stays. We note that the parameterization of 
BayesLSH is intuitive - the desired recall can be controlled 
using e, while 5,7 together specify the desired level of accu- 
racy of similarity estimation. 

The advantages of BayesLSH are as follows: 

1. The general form of the algorithm can be easily adapted 
to work for any similarity measure with an associated 
LSH family (see Section [2] for a formal definition of 
LSH). We demonstrate BayesLSH for Cosine and Jac- 
card similarity measures, and believe that it can be 
adapted to other measures with LSH families, such as 
kernel similarities. 

2. There are no restricting assumptions about the specific 
form of the candidate generation algorithm; BayesLSH 
complements progress in candidate generation algo- 
rithms. 

3. For applications which already use LSH for candidate 
generation, it is a natural fit since it exploits the hashes 
of the objects for candidate pruning, further amortiz- 
ing the costs of hashing. 

4. It works for both binary and general real-valued vec- 
tors. This is a significant advantage because recent 
progress in similarity search has been limited to bi- 
nary vectors |24l 126] . 

5. Parameter tuning is easy and intuitive; the only pa- 
rameters are 7, 8 and e, each of which, as we have 
seen, directly control the quality of the output result. 
In particular, there is no need for manually tuning the 
number of hashes, as one needs to with standard sim- 
ilarity estimation using LSH. 

We perform an extensive evaluation of our algorithms and 
comparison with state-of-the-art methods, on a diverse array 
of 6 real datasets. We combine BayesLSH and BayesLSH- 
Lite with two different candidate generation algorithms All- 
Pairs [3] and LSH, and find significant speedups, typically 



in the range 2x-20x over baseline approaches (see Table [2]). 
BayesLSH is able to achieve the speedups primarily by being 
extremely effective at pruning away false positive candidate 
pairs. To take a typical example, BayesLSH is able to prune 
away 80% of the input candidate pairs after examining only 
8 bytes worth of hashes per candidate pair, and 99.98% of 
the candidate pairs after examining only 32 bytes per pair. 
Notably, BayesLSH is able to do such effective pruning with- 
out adversely affecting the recall, which is still quite high, 
generally at 97% or above. Furthermore, the accuracy of 
BayesLSH's similarity estimates is much more consistent as 
compared to the standard similarity approximation using 
LSH, which tends to produce very error-ridden estimates 
for low similarities. Finally, we find that parameter tun- 
ing for BayesLSH is intuitive and works as expected, with 
higher accuracies and recalls being achieved without leading 
to undue slow-downs. 

2. BACKGROUND 

Following Charikar [5] , we define a locality-sensitive hash- 
ing scheme as a distribution on a family of hash functions T 
operating on a collection of objects, such that for any two 
objects x, y, 

Pr h e^{h(x) = h(y)] = sim(x,y) (1) 

It is important to note that the probability in Eqn[T]is for 
a random selection of the hash function from the family T. 
Specifically, it is not for a random pair x, y - i.e. the equation 
is valid for any pair of objects x and y. The output of the 
hash functions may be either bits (0 or 1), or integers. Note 
that this definition of LSH, taken from [6], is geared towards 
similarity measures and is more useful in our context, as 
compared to the slightly different definition of LSH used 
by many other sources [7] [2], including the original LSH 
paper [TT], which is geared towards distance measures. 

Locality-sensitive hashing schemes have been proposed for 
a variety of similarity functions thus far, including Jaccard 
similarity [3] [TS] , Cosine similarity \G\ and kernelized simi- 
larity functions (representing e.g. a learned similarity met- 
ric) [12] . 

Candidate generation via LSH: 

One of the main reasons for the popularity of LSH is that it 
can be used to construct an index that enables efficient can- 
didate generation for the similarity search problem. Such 
LSH-based indices have been found to significantly outper- 
form more traditional indexing methods based on space par- 
titioning approaches, especially with increasing dimensions [111 
E|. The general method works as follows [TT1 [71 l5l l20l [TO] . 
For each object in the dataset, we will form I signatures, 
where each signature is a concatentation of k hashes. All 
pairs of objects that share at least one of the I signatures 
will be generated as a candidate pair. Retrieving each pair 
of objects that share a signature can be done efficiently us- 
ing hashtables. For a given k and similarity threshold t, the 
number of length-A: signatures required for an expected false 
negative rate e can be shown to be Z = [■ — log 6 ,,, 1 1251 . 

1 log(l — t K ) 1 1 * 

Candidate verification and similarity estimation: 

The similarity between the generated candidates can be com- 
puted in one of two ways: (a) by exact calculation of the 
similarity between each pair, or (b) using an estimate of 
the similarity, as the fraction of hashes that the two objects 
agree upon. The pairs of objects with estimated similarity 



greater than the threshold are finally output. In terms of 
running time, approach (b) is often faster, especially when 
the number of candidates is large and/or exact similarity 
calculations are expensive, such as with more complex sim- 
ilarity measures or with larger vector lengths. The main 
overhead with approach (b) is in hashing each point suf- 
ficient number of times in the first place, but this cost is 
amortized over many similarity computations (especially in 
the case of all-pairs similarity search), and furthermore we 
need the hashes for candidate generation in any case. How- 
ever, what is less clear is how good this simple estimation 
procedure is in terms of accuracy, and whether it can be 
made any faster. We will address these questions next. 

3. CLASSICAL SIMILARITY ESTIMATION 
FOR LSH 

Similarity estimation for a candidate pair using LSH can 
be considered as a statistical parameter inference problem. 
The parameter we wish to infer is the similarity, and the 
data we observe is the outcome of the comparison of each 
successive hash between the candidate pair. The probability 
model relating the parameter to the data is given by the 
main LSH equation, Equation[T] There are two main schools 
of statistical inference - classical (frequentist) and Bayesian. 

Under classical (frequentist) statistical inference, the pa- 
rameters of a probability model are treated as fixed, and it 
is considered meaningless to make probabilistic statements 
about the parameters - hence the output of classical infer- 
ence is simply a point estimate, one for each parameter. 
The best known example of frequentist inference is maxi- 
mum likelihood estimation, where the value of the param- 
eter that maximizes the probability of the observed data is 
output as the point estimate. In the case of similarity esti- 
mation via LSH, let us say we have compared n hashes and 
have observed m agreements in hash values. The maximum 
likelihood estimator for the similarity I isQ 

m 
n 

While previous researchers have not explicitly labeled their 
approaches as using the maximum likelihood estimators, 
they have implicitly used the above estimator, tuning the 
number of hashes n |20l [BJ. However, this approach has 
some important drawbacks, which we turn to next. 

3.1 Difficulty of tuning the number of hashes 

While the above estimator is unbiased, the variance is 
s *^~ s - ) , meaning that the variance of the estimator depends 
on the similarity s being estimated. This indicates that in 
order to get the same level of accuracy for different similar- 
ities, we will need to use different number of hashes. 

We can be more precise and, for a given similarity, calcu- 
late exactly the the probability of a smaller-than-<5 error in 
in, the similarity estimated using n hashes. 



Pr[\s„ -s\<5] 



Pr[(s — 5) * n < m < (s + 8) * n] 



E 

m— (s — S) *n 



s m (l 




0" 

Similarity 

Figure 1: Hashes vs. 
similarity 



Using the above expression, we can calculate the minimum 
number of hashes needed to ensure that the similarity es- 

1 Proofs are elementary and are omitted. 



timate is sufficiently concentrated, i.e within 5 of the true 
value with probability 1 — 7. A plot of the number of hashes 
required for 5 — 7 = 0.05 for various similarity values is 
given in Figure [1] As can be seen, there is a great differ- 
ence in the number of hashes required when the true sim- 
ilarities are different; similarities closer to 0.5 require far 
more hashes to estimate accurately than similarities close 
to or 1. A similarity of 0.5 needs 350 hashes for suffi- 
cient accuracy, but a similarity of 0.95 needs only 16 hashes! 
Stricter accuracy require- 
ments lead to even greater 
differences in the required 
number of hashes. 

Since we don't know the 
true similarity of each pair a 
priori, we cannot choose the 
right number of hashes be- 
forehand. If we err on the 
side of accuracy and choose 
a large n, then performance 
suffers since we will be com- 
paring many more hashes 

than are necessary for some candidate pairs. If, on the other 
hand, we err on the side of performance and choose a smaller 
n, then accuracy suffers. With standard similarity estima- 
tion, therefore, it is impossible to tune the number of hashes 
for the entire dataset so as to achieve both optimal perfor- 
mance and accuracy. 

3.2 Ignores the potential for early pruning 

In the context of similarity search with a user-specified 
threshold, the standard similarity estimation procedure also 
misses opportunities for early candidate pruning. The intu- 
ition here is best illustrated using an example: Let us say the 
similarity threshold is 0.8 i.e. the user is only interested in 
pairs with similarity greater than 0.8. Let us say the similar- 
ity estimation is going to use n — 1000 hashes. But if we are 
examining a candidate pair for which, out of the first 100 
hashes, only 10 hashes matched, then intuitively it seems 
very likely that this pair does not meet the threshold of 0.8. 
In general, it seems intuitively possible to be able to prune 
away many false positive candidates by looking only at the 
first few hashes, without needing to compare all the hashes. 
As we will see, most candidate generation algorithms pro- 
duce significant number of false positives, and the standard 
similarity estimation procedure using LSH does not exploit 
the potential for early pruning of candidate pairs. 

4. CANDIDATE PRUNING AND SIMILAR- 
ITY ESTIMATION USING BAYESLSH 

The key characteristic of Bayesian statistics is that it al- 
lows one to make probabilistic statements about any as- 
pect of the world, including things that would be consid- 
ered "fixed" under frequentist statistics and hence mean- 
ingless to make probabilistic statements about. In particu- 
lar, Bayesian statistics allows us to make probabilistic state- 
ments about the parameters of probability models - in other 
words, parameters are also treated as random variables. 
Bayesian inference generally consists of starting with a prior 
distribution over the parameters, and then computing a pos- 
terior distribution over the parameters, conditional on the 
data that we have actually observed, using Bayes' rule. A 



commonly cited drawback of Bayesian inference is the need 
for the prior probability distribution over the parameters, 
but a reasonable amount of data generally "swamps out" the 
influence of the prior (see Appendix). Furthermore, a good 
prior can often lead to improved estimates over maximum 
likelihood estimation - this is a common strategy for avoid- 
ing overfitting the data in machine learning and statistics. 
The big advantage of Bayesian inference in the context of 
similarity estimation is that instead of just outputing a point 
estimate of the similarity, it gives us the complete posterior 
distribution of the similarity. In the rest of this section, we 
will avoid discussing specific choices for the prior distribu- 
tion and similarity measure in order to keep the discussion 
general. 

Fix attention on a particular pair (x,y), and let us say 
that m out of the first n hashes match for this pair. We will 
denote this event as M(m, n). The conditional probability 
of the event M(m, n) given the similarity S (here S is a 
random variable), is given by the binomial distribution with 
n trials, where the success of each trial is S itself, from the 
Equation [1] Note that we have already observed the event 
M(m,n) happening i.e. m and n are not random variables, 
they are the data. 



Pr[M(m,n)\S] = 



S m (l-£0* 



(2) 



What we are interested in knowing is the probability distri- 
bution of the similarity S, given that we already know that 
m out of n hashes have matched. Using Bayes' rule, the 
posterior distribution for S can be written as follows: □ 



p(S\M(m,n)) = 



p(M(m,n)\S)p(S) 

p(M(m,n)) 
p(M(m,n)\S)p(S) 
f Q p(M(m,n), s)ds 
p(M(m,n)\S)p(S) 
p(M(m, n) | s)p(s)ds 

By plugging in the expressions for p(M(m, n) \ S) from Equa- 
tion [2] and a suitable prior distribution p(S), we can get, for 
every value of n and m, the posterior distribution of S con- 
ditional on the event M(m,n). We calculate the following 
quantities in terms of the posterior distribution: 

1. If after comparing n hashes, m matches agree, what is 
the probability that the similarity is greater than the 
threshold t? 

Pr[S > 1 1 M(m, n)] = J p{s\M(m,n))ds (3) 

2. If after comparing n hashes, m matches agree, what 
is the maximum-a-posteriori estimate for the similar- 
ity i.e. the similarity value with the highest posterior 
probability? This will function as our estimate S 



S = arg max p(s\M(m, n)) 



(4) 



3. Assume after comparing n hashes, m matches agree, 
and we have estimated the similarity to be S (e.g. as 



2 In terms of notation, we will use lower-case p(.) for prob- 
ability density functions of continuous random- variables. 
Pr[.] is used for probabilities of discrete events or discrete 
random variables. 



indicated above) . What is the concentration probabil- 
ity of S i.e. probability that this estimate is within <5 
of the true similarity? 



Pr[\S- S\ < 8\M(m,n) 



Pr[S -5 < S < S + 5\ M(m, {$) 
rS+s 

/ p(s | M(m, n))ds (6) 
JS-6 



Assuming we can perform the above three kinds of inference, 
we design our algorithm, BayesLSH, so that it satisfies the 
probabilistic guarantees outlined in Section [1] The algo- 
rithm is outlined in Algorithm [TJ For each candidate pair 
(x, y) we incrementally compare their respective hashes (line 
8, the parameter k indicates the number of hashes we will 
compare at a time), until either one of two events happens. 
The first possibility is that the candidate pair gets pruned 
away because the probability of it being a true positive pair 
has become very small (lines 10, 11 and 12), where we use 
Equation [3] to calculate this probability. The alternative 
possibility is that the candidate pair does not get pruned 
away, and we continue comparing hashes until our similar- 
ity estimate (line 14) becomes sufficiently concentrated that 
it passes our accuracy requirements (lines 15 and 16). Here 
we use Equation |S] to determine the probability that our es- 
timate is sufficiently accurate. Each such pair is added to 
the output set of candidate pairs, along with our similarity 
estimate (lines 19 and 20). 

Our second algorithm, BayesLSH-Lite (see Algorithm[2]) is 
a simpler version of BayesLSH, which calculates similarities 
exactly. Since the similarity calculations are exact, there is 
no need for parameters 5, 7; however, this comes at the cost 
of some intuitiveness, as there is a new parameter h specify- 
ing the maximum number of hashes that will be examined 
for each pair of objects. BayesLSH-Lite can be faster than 
BayesLSH for those datasets where exact similarity calcula- 
tions are cheap, e.g. because the object representations are 
simpler, such as binary, or if the average size of the objects 
is small. 

BayesLSH clearly overcomes the two drawbacks of stan- 
dard similarity estimation explained in Sections 13.1 1 and !3. 21 
Any candidate pairs that can be pruned away by examining 
only the first few hashes will be pruned away by BayesLSH. 
As we will show later, this method is very effective for prun- 
ing away the vast majority of false positives. Secondly, the 
number of hashes for which each candidate pair is compared 
is determined automatically by the algorithm, depending on 
the user-specified accuracy requirements, completely elim- 
inating the need to manually set the number of hashes. 
Thirdly, each point in the dataset is only hashed as many 
times as is necessary. This will be particularly useful for 
applications where hashing a point itself can be costly e.g. 
for kernel LSH 12 . Also, outlying points which don't have 
any points with whom their similarity exceeds the threshold 
need only be hashed a few times before BayesLSH prunes 
all candidate pairs involving such points away. 

In order to obtain a concrete instantiation of BayesLSH, 
we will need to specify three aspects: (i) the LSH family 
of hash functions, (ii) the choice of prior and (iii) how to 
tractably perform inference. Next, we will look at specific 
instantiations of BayesLSH for different similarity measures. 

4.1 BayesLSH for Jaccard similarity 

We will first discuss how BayesLSH can be used for ap- 
proximate similarity search for Jaccard similarity. 



Algorithm 1 BayesLSH 
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Input: Set of candidate pairs C; Similarity threshold t; recall 
parameter e; accuracy parameters <5, 7 

Output: Set O of pairs (x, y) along with similarity estimates 

Sx,y 

for all (x,y) 6 C do 

n,m {Initialization} 
isPruned <— False 

while True do 

m = m + == fti(y)] {Compare 

hashes n to n + A;} 
n = n + fc 

if Pr[5 > i I M(m,n)] < e then 
isPruned <— True 
break {Prune candidate pair} 

end if 

S 4— arg maxs p(s|M(m, n)) 

if Pr[\S - 5 I M(m, n) < 8] < 7 then 

break {Similarity estimate is sufficiently con- 
centrated} 
end if 
end while 

if isPruned == False then 
0<-OU{((x,y),S)} 

end if 
end for 
return O 



Algorithm 2 BayesLSH-Lite 



hi(y)] {Compare 



1: Input: Set of candidate pairs C; Similarity threshold t; recall 

parameter e; Number of hashes to use h 
2: Output: Set O of pairs (x, y) along with exact similarities 

3: O^0 

4: for all (x,y) e C do 

5: n,m <— {Initialization} 

6: isPruned <— False 

7: while n < h do 

8: m = m + E"j"„* WO : 

hashes n to n + t} 
9: n = n + A; 

10: if Pr[S > 1 1 M(m,n)] < e then 

11: isPruned <— True 

12: break {Prune candidate pair} 

13: end if 

14: end while 

15: if isPruned == False then 

16: s X}V = similarity (x , y) {Exact similarity} 

17: if s x y > t then 

18: 6 ^OU{((x,y),s x . y )} 

19: end if 

20: end if 

21: end for 

22: return O 



LSH family: The LSH family for Jaccard similarity is the 
family of minwise independent permutations }4, ?] on the 
universe from which our collection of sets is drawn. Each 
hash function returns the minimum element of the input set 
when the elements of the set are permuted as specified by 
the hash function (which itself is chosen at random from 
the family of minwise independent permutations) . The out- 
put of this family of hash functions, therefore, is an integer 
representing the minimum element of the permuted set. 

Choice of prior: It is common practice in Bayesian infer- 
ence to choose priors from a family of distributions that is 
conjugate to the likelihood distribution, so that the inference 
is tractable and also that the posterior belongs to the same 
distribution family as the prior (indeed, that is the definition 
of a conjugate prior). The likelihood in this case is given by a 
binomial distribution, as indicated in Equation [2] The con- 
jugate for the binomial is the Beta distribution, which has 
two parameters a > 0, /3 > and is defined on the domain 
(0, 1). The pdf for Beta(a, f3) is defined as follows. 



p(s) = 



sf)-l 



Here B(a, j3) is the beta function, and it can also be thought 
of as a normalization constant to ensure the entire distribu- 
tion integrates to 1. 

Even assuming we want to model the prior using a Beta 
distribution, how do we choose the parameters a, /?? A sim- 
ple choice is to set a = 1,(3 = 1, which results in a uniform 
distribution on (0, 1). However, we can actually learn a, j3 so 
as to best fit a random sample of similarities from candidate 
pairs output by the candidate generation algorithm. Let 
us assume we have r samples chosen uniformly at random 
from the total population of candidate pairs generated by 
the particular candidate generation algorithm being used, 
and their similarities are si, S2, ■ ■ ■ , s r . Then we can esti- 
mate a, ft so as to best model the distribution of similarities 
among candidate pairs. For Beta distribution, a simple and 
effective method of learning the parameters is via method- 
of-moments estimation. In this method, we calculate the 
sample moments (sample mean and sample variance), as- 
sume that they are the true moments of the distribution 
and solve for the parameter values that will result in the 
obtained moments. In our case, we have the following esti- 
mates for a, f3: 



8(1 



- 1 ; /? = (1 - 5) 



5(1 - S) 



- 1 



where s and s v are the sample mean and variance, given as 
follows: 

- ELi s » . _ _ ELi (si - sf 

r r 

Assuming a prior Beta(ot, /3) distribution on the similar- 
ity, and we observe the event M(m,n) i.e. m out of the 
first n hashes match, then the posterior distribution of the 
similarity looks as follows: 



p(s\M(m,n)) 



o m (i-*r 



\l-s) 



/oO m (l-*) 
B(m + a,n — m + j3) 



S a 1 (1 — s) 



0-1 



Hence, the posterior distribution of the similarity also fol- 
lows a Beta distribution with parameters m+a and n— m+f3. 

Inference: We next show concrete ways to perform infer- 
ence, i.e. computing Equations [3] [4] and [6] 

The probability that similarity is greater than the thresh- 
old after observing that m out of the first n hashes match 
is: 

Pr[S > t\M(m,n)] = J p(s\M(m,n)) 

= 1 — It(m + a, n — m + /3) 

Above It(., .) refers to the regularized incomplete beta func- 
tion, which gives the cdf for the beta distribution. This func- 
tion is available in standard scientific computing libraries, 
where it is typically approximated using continued fractions [?] 

Our similarity estimate, after observing m matches in 
n hashes, will be the mode of the posterior distribution 
p(s\M(m,n)). The mode of Beta(a, ft) is given by Q "~j_ 2 . 
Therefore, our similarity estimate after observing that m 
out of the first n hashes agree is S = . 

The concentration probability of the similarity estimate 
S can be derived as follows (the expression for S indicated 
above can be substituted in the below equations): 



Pr[\S- S\ < S\M(m,n)] 



rS+S 

- 1 

JS-5 



IS-S 

Is 



p(s\M(m, n))ds 



S-is( m + c,n — m + /3) 
Ig_g(m + a, n — m + 0) 



Thus by substituting the above computations in the cor- 
responding places in Algorithm [1] we obtain a version of 
BayesLSH specifically adapted to Jaccard similarity. 

4.2 BayesLSH for Cosine similarity 

We will next discuss instantiating BayesLSH for Cosine 
similarity. 

LSH family: For Cosine similarity, each hash function hi 
is associated with a random vector r;, each of whose compo- 
nents is a sample from the standard gaussian (/i = 0, a = 1). 
For a vector x, hi(x) = 1 if dotiri, x) > and hi(x) = oth- 
erwise 6 . Note that each hash function outputs a bit, and 
hence these hashes can be stored with less space. 

However, there is one challenge here that needs to be 
overcome that was absent for BayesLSH with Jaccard sim- 
ilarity: this LSH family is for a slightly different similar- 
ity measure than cosine - it is instead for 1 — where 
9(x,y) — arccos ( fjlffffyH )• For notational ease, we will refer 

to this similarity function as r(x,y) i.e. r(x,y) = 1 — -^r^ . 
Explicitly, 



Pr[hi(a 



hi(y)] = r{x,y) 



Pr[M( 



m, n r = 



'(1 



Since the similarity function we are interested in is cos(x, y) 
and not r(x,y) - in particular, we wish for probabilistic guar- 
antees on the quality of the output in terms of cos(x, y) and 
not r(x, y) - we will need to somehow express the posterior 
probability in terms of s — cos(x,y). One can choose to 
re-express the likelihood in terms of s — cos(x, y) instead 



of in terms of r but this introduces cos() terms into the 
likelihood, and makes it very hard to find a suitable prior 
that keeps the inference tractable. Instead we compute the 
posterior distribution of r which we transform appropriately 
into a posterior distribution of s. 

Choice of prior: We will need to choose a prior distri- 
bution for r. Previously, we used a Beta prior for Jaccard 
BayesLSH; unfortunately r has range [0.5, 1], while the stan- 
dard Beta distribution has support on the domain (0, 1). 
We can still map the standard Beta distribution onto the 
domain (0.5, 1), but this distribution will no longer be con- 
jugate to the binomial likelihood^ Our solution is to use a 
simple uniform distribution on [0.5, 1] as the prior for t.Even 
when the true similarity distribution is very far from being 
uniform (as is the case in real datasets, including the ones 
used in our experiments), this prior still works well because 
the posterior is strongly influenced by the actual outcomes 
observed (see Appendix). 
The prior pdf therefore is: 

1 

p(r) = 



1-0.5 



The posterior pdf, after observing that m out of the first 
n hashes agree, is: 



p(r\M(m, n)) 



2(> m (l -if" 



1 dr 



/o. 5 20-(l-r)* 

y.m/J y,*jrL — m 

Jo 5 r '™( 1 ~ r)"- m dr 



Bi(m + 1, n — m + 1) — Bo.s(r7i + X,n — m+ 1) 



Here B x (a, b) is the incomplete Beta function, defined as 
B x (a,b) = j; y a - 1 (l~y) b - 1 dy. 

Inference: In order to calculate Equations [3] [6] and 21 we 
will first need a way to convert from r to s and vice-versa. 
Let r2c : [0.5, 1] — > [0, 1] be the 1-to-l function that maps 
from r(x,y) to cos(x,y); r2c() is given by r2c(r) = cos(7r * 
(1 — r)). Similarly, let c2r be the 1-to-l function that does 
the same map in reverse; c2r() is given by c2r(c) = 1 — 

arccos(c) 



Let R be the random variable such that R — c2r(S) and 
let t r — c2r(t). After observing that the m out of the first n 
hashes agree, the probability that cosine similarity is greater 
than the threshold t is: 



Pr[S > t\M(m,n)] 



Pr[c2r(S) > c2r(t)\M(m,n)] 
Pr[R > t r \M(m,n)} 

J p(r\M(m, n))dr 

// r m (l - r) n - 



l dr 



B\ (m + 1) ti — m + 1) — Bo. 5 ( m + 1) n — m + 1) 
Bi (m 4- 1) n — m + 1) — Bt r (m + 1, n — m 4- 1) 
Bi (m 4- 1, n — m + 1) — Bo. 5 {m 4- 1, n — m 4- 1) 



The pdf of a Beta distribution supported only on (0.5, 1) 

with parameters a,fi is p(x) tx (x — 0.5) Q_1 (1 — x) 13 ^ 1 . 
With a binomial likelihood, the posterior pdf takes the form 



n — m+/3 — 1 



p(x\M(m,n)) oc x m (x - 0.5) a ~ l (l - x) 
nately there is no simple and fast way to integrate this pdf. 



Unfortu- 



The first step in the above derivation follows because c2r() 
is a 1-to-l mapping. Thus, we have a concrete expression 
for calculating Eqn[3] 

Next, we need an expression for the similarity estimate 
S, given that m out of n hashes have matched so far. Let 
R = arg max r p(r|Af(m, nj). We can obtain a closed form 
expression for R by solving for dp(<-|M(m,n) _ q, wnen we 

do this, we get r = f . Hence, R = Now S = r2c(R), 
therefore S = r2c( — ). This is our expression for calculating 
Eqng] 

Next, let us consider the concentration probability of S. 

Pr[\S-S\ < 6\M(m,n)] 
= Pr[S - S < S < S + 8\M(m,n)} 
= Pr[c2r(S - 8) < c2r(S) < c2r(5 + S)\M{m, n)\ 
= Pr[c2r(S - 5) < R < c2r(S + 5)\M{m, n)] 

rc2r(S+S) m(i _ r \n-m j 
J c2r{S-S) K ' 

B\ (m + 1, n — m + 1) — -Bo. 5 (m + 1, n — m + 1) 
B c2r(S+S) (m + 1, n - m + 1) - B c2r(§ _ s) (m + 1, n - m + 1) 
Bi (m + 1, n — m + 1) — -Bo.5( m + l,n — m + 1) 

Thus, we have concrete expressions for Equations [3] [4] and 
[6] giving us an instantiation of BayesLSH adapted to Cosine 
similarity. 

4.3 Optimizations 

The basic BayesLSH can be optimized without affecting 
the correctness of the algorithm in a few ways. The main 
idea behind the optimizations here is to minimize the num- 
ber of times inference has to be performed, in particular the 
Equations [3] and [6] 

Pre-computation of minimum matches: We pre-compute 
the minimum number of matches a candidate pair needs to 
have in order for Pr[S > t\M(m,n)] > e to be true, thus 
completely eliminating the need for any online inference in 
line 10 of Algorithm[T] For every value of n that we will con- 
sider (upto some maximum), we pre-compute the function 
minMatches(n) defined as follows: 

minMatches(n) — a,rgmiii Pr[S > t\M(m,n)] > e 

m 

This can be done via binary search, since Pr[S > t\M(m, n)\ 
increases monotonically with m for a fixed n. Now, for each 
candidate pair, we simply check if the actual number of 
matches for that pair at every n is at least minM atches(n) . 
Note that we will not encounter every possible value of n 
upto the maximum - instead, since we compare k hashes at 
a time, we need to compute minMatches() only once for all 
multiples of k upto the maximum. 

Cache results of inference: We maintain a cache indexed 
by (m, n) that indicates whether or not the similarity es- 
timate that is obtained after m hashes out of n agree is 
sufficiently concentrated or not (Equation [6]). Note that 
for each possible n, we only need to cache the results for 
m > minMatches(n), since lower values of m are guaran- 
teed to result in pruning. Thus, in the vast majority of 
cases, we can simply fetch the result of the inference from 
the cache instead of having to perform it afresh. 

Cheaper storage of hash functions: For cosine similar- 
ity, storing the random gaussian vectors corresponding to 



each hash function can take up a fair amount of space. To 
reduce this storage requirement, we developed a scheme for 
storing each float using only 2 bytes, by exploiting the fact 
that random gaussian samples from the standard 0-mean, 

1- standard deviation gaussian lie well within a small inter- 
val around 0. Let us assume that all of our samples will lie 
within the interval (-8,8) (it is astronomically unlikely that 
a sample from the standard gaussian lies outside this inter- 
val). For any float x £ (—8,8), it can be represented as a 

2- byte integer x' — [(x + 8) * ^g-J . The maximum error of 
this scheme is 0.0001 for any real number in (—8,8). 

5. EXPERIMENTS 

We experimentally evaluated the performance of BayesLSH 
and BayesLSH-Lite on 6 real datasets with widely varying 
characteristics (see Table [T]). 

• RCV1 is a text corpus of Reuters articles and is a 
popular bechmarking corpus for text-categorization re- 
search [14]. We use the standard pre-processed version 
of the dataset with word stemming and tf-idf weight- 
ing. 

• Wiki datasets. We pre-processed the article dump of 
the English WikipedisQ - Sep 2010 version - to pro- 
duce both a text corpus of Wiki articles as well as 
the directed graph of hyperlinks between Wiki articles. 
Our pre-processing includes the removal of stop-words, 
removal of insignificant articles, and tf-idf weighting 
(for both the the text and the graph). Words occur- 
ing at least 20 times in the entire corpus are used 
as features, resulting in a dimensionality of 344,352. 
The WikiWordslOOK dataset consists of text vec- 
tors with at least 500 non-zero features, of which there 
are 100,528. The WikiWords500K dataset consists 
of vectors with at least 200 non-zero features, of which 
there are 494,244. The WikiLinks dataset consists of 
the entire article-article graph among ~1.8M articles, 
with Tf-idf weighting. 

• Orkut consists of a subset of the (undirected) friend- 
ship network among nearly 3M Orkut users, made 
available by [19] . Each user is represented as a weighted 
vector of their friends, with Tf-idf weighting. 

• Twitter consists of the directed graph of follower/followeee 
relationships among the subset of Twitter users with 

at least 1,000 followers, first collected by Kwak et. 
al. [13]. Each user is represented as a weighted vec- 
tor of the users they follow, with Tf-idf weighting. 

We note that all our datasets represent realistic applica- 
tions for all pairs similarity search. Similarity search on text 
corpuses can be useful for clustering, semi-supervised learn- 
ing, near-duplicate detection etc., while similarity search on 
the graph datasets can be useful for link prediction, friend- 
ship recommendation and clustering. Also, in our exper- 
iments we primarily focus on similarity search for general 
real-valued vectors using Cosine similarity, as opposed to 
similarity search for binary vectors (i.e. sets). Our reasons 
are as follows: 

1. Representations of objects as general real- valued vectors 
are generally more powerful and lead to better similarity 
assessments, Tf-idf style representations being the classic 

4 http://download. wikimedia.org 



Dataset 


Vectors 


Dimensions 


Avg. len 


Nnz 


RCV1 


804,414 


47,236 


76 


61e6 


WikiWordslOOK 


100,528 


344,352 


786 


79e6 


WikiWords500K 


494,244 


344,352 


398 


196o6 


WikiLinks 


1,815,914 


1,815,914 


24 


44e6 


Orkut 


3,072,626 


3,072,626 


76 


233e6 


Twitter 


146,170 


146,170 


1369 


200e6 



Table 1: Dataset details. Nnz stands for number of 
non-zeros. 

example here (see [21] for another example from graph min- 
ing). 

2. Similarity search is generally harder on real- valued vec- 
tors. With binary vectors (sets), most similarity measures 
are directly proportional to the overlap between the two 
sets, and it is easier to obtain bounds on the overlap be- 
tween two sets by inspecting only a few elements of each set, 
since each element in the set can only contribute the same, 
fixed number (1) to the overlap. On the other hand, with 
general real- valued vectors, different elements/features have 
different weights (also, the same feature may have different 
weights across different vectors), meaning that it is harder 
to bound the similarity by inspecting only a few elements of 
the vector. 

5.1 Experimental setup 

We compare the following methods for all-pairs similarity 
search. 

1. AllPairs [3] (AP) is one of the state-of-the-art ap- 
proaches for all-pairs similarity search, especially for cosine 
similarity on real-valued vectors. AllPairs is an exact algo- 
rithm. 

2,3. AP+BayesLSH, AP+BayesLSH-Lite: These are 
variants of BayesLSH and BayesLSH-Lite where the input 
is the candidate set generated by AllPairs. 
4,5. LSH, LSH Approx: These are two variants of the 
standard LSH approach for all pairs similarity search. For 
both LSH and LSH Approx, candidate pairs are generated as 
described in Section [2]; for LSH, similarities are calculated 
exactly, whereas for LSH Approx, similarities are instead 
estimated using the standard maximum likelihood estima- 
tor, as described in Section [3] For LSH Approx, we tuned 
the number of hashes and set it to 2048 for cosine similar- 
ity and 360 for Jaccard similarity. Note that the hashes for 
Cosine similarity are only bits, while the hashes for Jaccard 
are integers. 

6,7. LSH+BayesLSH, LSH+BayesLSH-Lite: These 
are variants of BayesLSH that take as input the candidate 
set generated by LSH as described in Section [2] 
8. PPJoin+ [24] is a state-of-the-art exact algorithm for 
all-pairs similarity search, however it only works for binary 
vectors and we only include it in the experiments with Jac- 
card and binary cosine similarity. 

For all BayesLSH variants, we report the full execution 
time i.e. including the time for candidate generation. For 
BayesLSH variants, e = 7 = 0.03 and 8 — 0.05 (7, 5 don't 
apply to BayesLSH-Lite). For the number of hashes to be 
compared at a time, k, it makes sense to set this to be a mul- 
tiple of the word size, since for cosine similarity, each hash 
is simply a bit. We set k = 32, although higher multiples 
of the word size work well too. In the case of BayesLSH- 
Lite, the number of hashes to be used for pruning was set 



to h — 128 for Cosine and h = 64 for Jaccard. For LSH 
and LSH Approx, the expected false negative rate is set to 
0.03 . The randomized algorithms (LSH variants, BayesLSH 
variants) were each run 3 times and the average results are 
reported. 

All of the methods work for both Cosine and Jaccard simi- 
larities, for both real-valued as well as binary vectors, except 
for PPJoin+, which only works for binary vectors. The code 
for PPJoin+ was downloaded from the authors' website, all 
the other methods were implemented by us0 All algorithms 
are single-threaded and are implemented in C/CH — h The 
experiments were run by submitting jobs to a cluster, where 
each node on the cluster runs on a dual-socket, dual-core 
2.3 GHz Opteron with 8GB RAM. Each algorithm was al- 
lowed 50 hrs (180K sees) before it was declared timed out 
and killed. 

We executed the different algorithms on both the weighted 
and binary versions of the datasets, using Cosine similarity 
for the weighted case and both Jaccard and Cosine for the 
binary case. For Cosine similarity, we varied the similarity 
threshold from 0.5 to 0.9, but for Jaccard we found that 
very few pairs satisfied higher similarity thresholds (e.g. for 
Orkut, a 3M record dataset, only 1648 pairs were returned 
at threshold 0.9), and hence varied the threshold from 0.3 to 

0. 7. For Jaccard and Binary Cosine, we only report results 
on WikiWords500K, Orkut and Twitter, which are our three 
largest datasets in terms of total number of non-zeros. 

5.2 Results comparing BayesLSH variants with 
baselines 

Figure [3] shows a comparison of timing results for all algo- 
rithms across a variety of datasets and thresholds. Table [2] 
compares the fastest BayesLSH variant with all the base- 
lines. The quality of the output of BayesLSH can be seen in 
Table [3] where we show the recall rates for AP+BayesLSH 
and AP+BayesLSH-Lite, and in Table [4] where we com- 
pare the accuracies of LSH and LSH+BayesLSH. The recall 
and accuracies of the other BayesLSH variants follow similar 
trends and are omitted. The main trends from the results 
are distilled and discussed below: 

1. BayesLSH and BayesLSH-Lite improve the running time 
of both AllPairs and LSH in almost all the cases, with 
speedups usually in the range 2x-20x. It can be seen from 
Table [2] that a BayesLSH variant is the fastest algorithm 
(in terms of total time across all thresholds) for the major- 
ity of datasets and similarities, with the exception of Orkut 
for Jaccard and binary cosine. Furthermore, the quality of 
BayesLSH output is high; the recall rates are usually above 
97% (see Table [3J, and similarity estimates are accurate, 
with usually no more than 5% output pairs with error above 
0.05 (see Table [4]). 

2. BayesLSH is fast primarily by being able to prune away 
the vast majority of false positives after comparing only 
a few hashes. This is illustrated in Figure [4] For Wiki- 
WordslOOK at a threshold of 0.7, (see Figure [4(a)] ) AllPairs 
supplies BayesLSH with nearly 5e09 candidates, while the 
result set only has 2.2e05. BayesLSH is able to prune away 
4.0e+09 (80%) of the input candidate pairs after examining 



Our AllPairs implementation is slightly faster than the 
original implementation of the authors due to a simple im- 
plementational fix. This has since been incorporated into 
the authors' implementation. 



only 32 hashes - in this case, each hash is a bit, so BayesLSH 
compared only 4 bytes worth of hashes between each pair. 
By the time BayesLSH has compared 128 hashes (16 bytes) 
there are only 1.0e06 candidates remaining. Similarly LSH 
supplies BayesLSH with 6.0e08 candidates - better than All- 
Pairs, but nonetheless orders of magnitude larger than the 
final result set - and after comparing 128 hashes (16 bytes), 
BayesLSH is able to prune that down to only 7.4e05, only 
about 3.5x larger than the result set. On the WikiLinks 
dataset (see Figure 4(b) I, we see a similar trend with the 
roles of AllPairs and LSH reversed - this time it is AUPairs 
instead which supplies BayesLSH with fewer candidates. Af- 
ter examining only 128 hashes, BayesLSH is able to reduce 
the number of candidates from 1.3e09 down to 1.2e07 for All- 
Pairs, and from 1.8ell down to 5.1e07 for LSH. Figure |4(cJ| 
shows a similar trend, this time on the binary version of 
WikiWordslOOK. 

3. We note that BayesLSH and BayesLSH-Lite often (but 
not always) have comparable speeds, since most of the speed 
benefit is coming from the ability of BayesLSH to prune, 
which is an aspect that is common to both algorithms. The 
difference between the two is mainly in terms of the hashing 
overhead. BayesLSH needs to obtain many more hashes of 
each object in order for similarity estimation; this cost is 
amortized at lower thresholds, where the number of similar- 
ity calculations needed to perform is much greater. BayesLSH- 
Lite is faster at higher thresholds or when exact similarity 
calculations are cheaper, such as datasets with low average 
vector length. 

4. AllPairs and LSH have complementary strengths and 
weaknesses. On the datasets RCV1, WikiWordslOOK, Wiki- 
Words500K and Twitter (see Figures |3(a)|3(c)|3(f)| ), LSH 
is clearly the faster algorithm than AllPairs (in the case 
of WikiWords500K, AllPairs did not finish execution even 
for the highest threshold of 0.9). On the other hand, All- 
Pairs is the much faster algorithm on WikiLinks and Orkut 
(see Figures [3(d)|3(e)| , with LSH timing out in most cases. 
Looking at the characteristics of the datasets, one can dis- 
cern a pattern: AllPairs is faster on datasets with smaller 
average length and greater variance in the vector lengths, as 
is the case with the graph datasets WikiLinks and Orkut. 
The variance in the vector lengths allows AllPairs to upper- 
bound the similarity better and thus prune away more false 
positives, and in addition the exact similarity computations 
that AllPairs does are faster when the average vector length 
is smaller. However, BayesLSH and BayesLSH-Lite enable 
speedups on both AllPairs and LSH, not only when each al- 
gorithm is slow, but even when each algorithm is already 
fast. 

5. The accuracy of BayesLSH's similarity estimates is much 
more consistent as compared to the standard LSH approx- 
imation, as can be seen from Table [4] LSH generally pro- 
duces too many errors when the threshold is low and too few 
errors when the threshold is high. This is mainly because 
LSH uses the same number of hashes (set to 2048) for esti- 
mating all similarities, low and high. This problem would 
persist even if the number of hashes was set to some other 
value, as explained in Section [3.11 BayesLSH, on the other 
hand, maintains similar accuracies at both low and high 
thresholds, without requiring any tuning at all on the num- 
ber of hashes to be compared, and only based on the user's 
specification of the desired accuracy using S, 7 parameters. 



6. LSH Approx is often much faster than LSH with exact 
similarity calculations, especially for datasets with higher 
average vector lengths, where the speedup is often 3x or 
more - on Twitter, the speedup is as much as lOx (see Fig- 
ure 



3(f)! 



7. BayesLSH does not enable speedups that are as signifi- 
cant for AllPairs in the case of binary vectors. We found that 
this was because AllPairs was already doing a very good job 
at generating a small candidate set, thus not leaving much 
room for improvement. In contrast, LSH was still generating 
a large candidate set, leaving room for LSH+BayesLSH to 
enable speedups. Interestingly, even though LSH generates 
about 10 times more candidates than AllPairs, the LSH vari- 
ants of BayesLSH are about 50-100% faster than AllPairs 
and its BayesLS H version s, on WikiWords500K and Twit- 
ter (see Figures |3(g)|3(i)| |. This is because LSH is a faster 
indexing and candidate generation strategy, especially when 
the average vector length is large. 

8. PPJoin+ is often the fastest algorithm at the highest 
thresholds (see Figures |3(g)|3(lJ| , but its performance de- 
grades very rapidly with lower thresholds. A possible expla- 
nation is that the pruning heuristics used in PPJoin+ arc 
effective only at higher thresholds. 

5.3 Effect of varying parameters of BayesLSH 

We next examine the effect of varying the parameters of 
BayesLSH - namely the accuracy parameters 7, 8 and the re- 
call parameter e. We vary each parameter from 0.01 to 0.09 
in increments of 0.02, while fixing the other two parame- 
ters to 0.05, and fix the dataset to WikiWordslOOK and 
threshold to 0.7 (cosine similarity). The effect of varying 
each of these parameters on the execution time is plotted in 
Figure [2] Varying the recall parameter e and the accuracy 
parameter 7 have barely any effect on the running time - 
however setting 8 to lower values does increase the running 
time significantly. Why does lowering 8 penalize the running 
time much more than lowering 7? This is because lowering 
8 increases the number of hashes that have to be compared 
for all result pairs, while lowering 7 increases the number 
of hashes that have to be compared only for those result 
pairs that have uncertain similarity estimates. It is interest- 
ing to note that even though 8 = 0.01 requires 2691 sees, it 
achieves a very low mean error of 0.001, while being much 
faster than LSH exact, which requires 6586 sees. Approxi- 
mate LSH requires 883 sees but is much more error-prone, 
with a mean error of 0.014. With 7 = 0.01, BayesLSH 
achieves a mean error of 0.013, while still being around 2x 
faster than approximate LSH. 

In Tabled we show the result of varying these parameters 
on the output quality. When varying a parameter, we show 
the change in output quality only for the relevant quality 
metric - e.g. for changing 7 we only show how the fraction 
of errors > 0.05 changes, since we find that recall is largely 
unaffected by changes in 7 and 8 (which is as it should be). 
Looking at the column corresponding to varying 7, we find 
that the fraction of errors > 0.05 increases as we expect 
it to when we increase 7, without ever exceeding 7 itself. 
When varying 8, we can see that the mean error reduces 
as expected for lower values of 8. Finally, when varying 
the recall parameter e, we find that the recall reduces with 
higher values of e as expected, with the false negative rate 
always less than e itself. 
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Figure 3: Timing comparisons between different algorithms. Missing lines/points are due to the respective 
algorithm not finishing within the allotted time (50 hours). 
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(b) WikiLinks, t=0.7, Cosine (c) WikiWordslOOK, t=0.7, Binary Co- 
sine 

Figure 4: BayesLSH can prune the vast majority of false positive candidate pairs by examining only a small 
number of hashes, resulting in major gains in the running time. 



Datasct 


Fastest 

BayesLSH 

variant 


Speedup w.r.t baselines 


AP 


LSH 


LSH 
Ap- 
prox 


PPJoin 




^f-Idf, Cosine 


RCV1 


LSH + 
BayesLSH 


7. be 


4.8x 


2.4x 




WikiWords- 
lOOK 


LSH + 
BayesLSH 


31. 4x 


15. lx 


2. Ox 




WikiWords- 
500K 


LSH + 
BayesLSH 


> 42. lx 


> 13. 3x 


2.8x 




WikiLinks 


AP + 
BayesLSH- 
Litc 


1.8x 


> 248. 2x 


> 

246. 3x 




Orkut 


AP + 
BaycsLSH- 
Lite 


1.2x 


> 114. 9x 


> 

155. 6x 




Twitter 


LSH + 
BayesLSH 


26. 7x 


33. 4x 


3. Ox 




Binary, Jaccard 


WikiWords- 
500K 


LSH + 
BayesLSH 


2. Ox 


> 16. 8x 


3.7x 


5.2x 


Orkut 


AP + 
BayesLSH- 
Lite 


0.8x 


2.9x 


2.8x 


l.lx 


Twitter 


LSH + 
BayesLSH 


1.8x 


48. 4x 


4.2x 


8. Ox 


E 


linary, Cosine 


WikiWords- 
500K 


LSH + 
BayesLSH 


2.3x 


> 10. 2x 


1.2x 


5.6x 


Orkut 


AP + 
BayesLSH- 
Lite 


0.8x 


> 201x 


> 

201x 


l.Ox 


Twitter 


AP + 
BayesLSH- 
Litc 


1.2x 


27. 4x 


1.2x 


3.7x 



► Varying gamma ■♦Varying delta v Varying epsilon 
■LSHApprox —LSH 



Table 2: Fastest BayesLSH variant for each dataset 
(based on total time across all thresholds), and 
speedups over each baseline. BayesLSH variants 
are fastest in all cases except for binary versions 
of Orkut, where it is only slightly sub-optimal. The 
range of thresholds for Cosine was 0.5 to 0.9, and for 
Jaccard was 0.3 to 0.7. PPJoin is only applicable to 
binary datasets. In some cases, only lower-bound 
on speedup is available as the baselines timed out 
(indicated with >). 
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Figure 2: Effect of varying y,5,e separately on the 
running time of LSH+BayesLSH. The dataset is 
WikiWordslOOK, with the threshold fixed at t=0.7 
for cosine similarity. For comparison, the times for 
LSH Approx and LSH are also shown. 



Dataset | t=0.5 | t=0.6 | t=0.7 | t=0.8 | t=0.9 


AllPairs+BayesLSH 


RCV1 


97.97 


98.18 


98.47 


99.08 


99.36 


WikiWordslOOK 


98.52 


98.84 


99.2 


98.58 


96.69 


WikiWords500K 


97.54 


97.82 


98.21 


98.16 


96.66 


WikiLinks 


97.45 


98.04 


98.46 


98.68 


99.18 


Orkut 


97.1 


97.8 


98.86 


99.84 


99.99 


Twitter 


97.7 


96 


96.88 


97.33 


98.77 


A 


lPairs+BayesLSH-Lite 


RCV1 


98.73 


98.82 


98.89 


99.26 


99.55 


WikiWordslOOK 


98.88 


99.31 


99.62 


99.69 


99.5 


WikiWords500K 


98.79 


98.72 


98.98 


98.74 


98.83 


WikiLinks 


98.53 


98.91 


99.16 


99.18 


99.45 


Orkut 


98.4 


98.64 


99.3 


99.87 


99.99 


Twitter 


99.44 


98.82 


97.17 


97.18 


99.06 



Table 3: Recalls (out of 100) of AllPairs+BayesLSH 
and AllPairs+BayesLSH-Lite across different 
datasets and different similarity thresholds. 



| t=0.5 | t=0.6 | t=0.7 | t=0.8 | t=0.9 


LSH Approx 


RCV1 


7.8 


4.3 


2.25 


0.8 


0.04 


WikiWordslOOK 


4.7 


3.6 


1 


0.3 


0.02 


WikiWords500K 


8.3 


5.7 


2.9 


0.9 


0.1 


WikiLinks 






1.6 


0.4 


0.06 


Orkut 










0.0072 


Twitter 


4 


5.1 


2.6 


0.4 


0.02 


LSH - 


|- BayesLSH 


RCV1 


3.2 


2.9 


3.2 


2 


1.4 


WikiWordslOOK 


2.7 


2.3 


3.5 


4.9 


2.2 


WikiWords500K 


3.4 


3.4 


3.2 


2.9 


2.1 


WikiLinks 


2.96 


2.82 


2.3 


2 


1.6 


Orkut 






1.5 


0.6 


0.09 


Twitter 


2.3 


4 


3.1 


4.8 


4.3 



for clarifications on the AllPairs implementation. This work 
is supported in part by the following NSF grants: IIS-1141828 
and IIS-0917070. 



Table 4: Percentage of similarity estimates with er- 
rors greater than 0.05; comparison between LSH Ap- 
prox and LSH + BayesLSH 



Parameter value 


Fraction er- 
rors > 0.05 
for varying 7 


Mean 
error for 
varying S 


Recall for 
varying e 


0.01 


0.7% 


0.001 


98.76% 


0.03 


2% 


0.01 


97.79% 


0.05 


3% 


0.017 


97.33% 


0.07 


4.2% 


0.022 


96.06% 


0.09 


5.4% 


0.027 


95.35% 



Table 5: The effect of varying the parameters 7, S, e 
one at a time, while fixing the other two parameters 
at 0.05. The dataset is WikiWordslOOK, with the 
threshold fixed at t=0.7; the candidate generation 
algorithm was LSH. 

6. CONCLUSIONS AND FUTURE WORK 

In this article, we have presented BayesLSH (and a sim- 
ple variant BayesLSH-Lite), a general candidate verifica- 
tion and similarity estimation algorithm for approximate 
similarity search, which combines Bayesian inference with 
LSH in a principled manner and has a number of advan- 
tages compared to standard similarity estimation using LSH. 
BayesLSH enables significant speedups for two state-of-the- 
art candidate generation algorithms, AllPairs and LSH, across 
a wide variety of datasets, and furthermore the quality of 
BayesLSH is easy to tune. As can be seen from Table [2] 
a BayesLSH variant is typically the fastest algorithm on a 
variety of datasets and similarity measures. 

BayesLSH takes a largely orthogonal direction to a lot of 
recent research in LSH, which concentrates on more effective 
indexing strategies, ultimately with the goal of candidate 
generation, such as Multi-probe LSH [17] and LSB-trees [23] • 
Furthermore, a lot of research on LSH is concentrated on 
nearest-neighbor retrieval for distance measures, rather than 
all pairs similarity search with a similarity threshold t. 

There are two promising avenues for future research with 
BayesLSH. First is to extend BayesLSH for similarity search 
with learned (kernelized) metrics, since such similarity mea- 
sures are often superior for complex domains [12] . Secondly, 
we believe that a BayesLSH-Lite analogue can be developed 
for candidate pruning in the case of nearest neighbor re- 
trieval for Euclidean distances (although the final distance 
may have to be calculated exactly). 

Acknowledgments: We thank Luis Rademacher and anony- 
mous reviewers for helpful comments, and Roberto Bayardo 
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likelihood - using sharply concentrated priors however brings 
the danger of not letting the data speak for themselves. 



(a) Prior distributions 



(b) Posterior after examining 
32 hashes, with 24 agreements 




(c) Posterior after examining 64 (d) Posterior after examining 
hashes, with 48 agreements 128 hashes, with 96 agreements 

Figure 5: Even very different prior distributions 
converge to very similar posteriors after examining 
a small number of outcomes (hashes) 



APPENDIX 

A. THE INFLUENCE OF PRIOR VS. DATA 

In this section, we show how the observed outcomes (i.e. 
hashes) are much more influential in determining the poste- 
rior distribution than the prior itself. Even if we start with 
very different prior distributions, the posterior distributions 
typically become very similar after observing a surprisingly 
small number of outcomes. 

Consider the similarity measure we worked with in the 
case of cosine similarity, r(x,y) = 1 — SSjMI ; w hich ranges 
from [0.5,1] - note that r(x,y) = 0.5 corresponds to an ac- 
tual cosine similarity of between x, y. Consider three very 
different prior distributions for this similarity measure, as 
follows (the normalization constants have been omitted): 

• Negatively sloped power law prior: p(s) oc x~ 3 

• Uniform prior: p(s) oc 1 

• Positively sloped power law prior: p(s) oc x 3 

In Figure[A] we show the posteriors for each of these three 
priors after observing a hypothetical series of outcomes for a 
pair of points x,y with cosine similarity 0.70, corresponding 
to r(x,y) — 0.75. Although to start off with, the three 
priors are very different (see Figure 5(a)[ ), the posteriors 



are already quite close after observing only 32 hashes and 



24 agreements (Figure 5(b) I, and the posteriors get closer 



quickly with increasing number of hashes (Figures 5(c) and 



5(d)! 



In general, the likelihood term - which, after observing n 
hashes with m agreements, is s m (l — g)( n ~ m ' . is much more 
sharply concentrated than a justifiable prior, very quickly as 
we increase n. In other words, a prior would itself have to be 
very sharply concentrated for it to match the influence of the 



